-
Notifications
You must be signed in to change notification settings - Fork 20.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eth: reset miner, engine, and txpool when calling SetHead to prevent crashes #16713
Conversation
core/tx_pool.go
Outdated
@@ -286,6 +286,12 @@ func (pool *TxPool) loop() { | |||
if pool.chainconfig.IsHomestead(ev.Block.Number()) { | |||
pool.homestead = true | |||
} | |||
if pool.chain.CurrentBlock().NumberU64() < head.NumberU64() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make more sense to use ev.Block
instead of pool.chain.CurrentBlock()
? It's faster and should be the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I just didn't realize they were the same. Updated.
Thank you for your contribution! Your commits seem to not adhere to the repository coding standards
Please check the contribution guidelines for more details. This message was auto-generated by https://gitcop.com |
63e6250
to
d10ec7e
Compare
Any additional feedback here? It looks like the CI tests are failing due a timeout unrelated to my changes. |
d10ec7e
to
217edb2
Compare
After running more of our smart contract tests against my fork, I realized Geth was still crashing in some cases after calling Rather than trying to track down every place in the code where we were holding onto some pointers to blocks, I decided to opt for a different approach. This new PR modifies I've updated my commit message and PR title to reflect the fact that the changes now lie in the "eth" package instead of the "core" package. I'm not sure if this covers everything that could be affected by |
On the one hand, this looks clean and the However, I'm a bit curious what happens with all existing channels and go-routines coupled to the old Could you try to e..g call |
f8db324
to
217edb2
Compare
@holiman Our tests call I assumed calling |
b.eth.blockchain.SetHead(number) | ||
b.eth.txPool = core.NewTxPool(b.eth.config.TxPool, b.eth.chainConfig, b.eth.blockchain) | ||
if b.eth.chainConfig.Clique != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At my company, we use a single node network for testing with Clique as the consensus engine. I'm not sure how to handle Ethash or other consensus engines here. Is that something we need to consider?
@holiman or @karalabe I believe I've answered your questions and responded to your comments. Do you have any additional feedback here? For the past 6-7 months at 0x, we've been running tests against our fork of Geth which includes these changes. We haven't experienced any major issues. Our test suite is extensive and hits the We have reached the point where we have some other developers in the 0x ecosystem using our fork of Geth and relying on the Edit: To clarify, the original issue that I wanted to fix with this PR appears to still exist (at least in some form; I'm not sure if there are still cases where Geth crashes or if it only hangs now). If we try running our test suite against the latest release (as of this comment, v1.8.21), Geth hangs/does not mine any new blocks and the tests are unable to complete. |
Commen from review discussion:
|
@holiman Thank you for the response. I agree with what you're saying in theory. Before I made any changes, I don't necessarily have the knowledge of Geth internals required to debug and fix issues with the chain re-org approach. This is especially true when you have complicated interactions between different parts of the codebase (e.g., as I described above Consider this: This PR addresses a specific, reproducible bug with Otherwise, if you really don't want to merge this PR as is, I will try my best to address your concerns and implement |
To clarify on the current status of this PR: It is currently not working due to merge conflicts/upstream changes. It will take some significant work for me to rebase and fix any issues, so I'm waiting for answers to my questions and some consensus on the approach before continuing. I'm happy to continue working on this PR. Just need some feedback first. |
Potentially fixed by #19216 in a cleaner way. That one reproduces the issue more deterministically, we might be able to fix this in the txpool fully isolated. |
We're trying to reproduce it and create an issue based on that. |
Superseded by #19308 |
This small change fixes a bug in tx_pool.go which causes geth to crash in certain cases when debug.setHead is used.
I discovered this bug while running a single-node, private PoA network for testing purposes. Here are the steps I took to consistently produce the crash:
--rpcapi 'miner,debug'
.I believe this is happening because when the
ChainHeadEvent
is received in tx_pool.go, the variablehead
refers to a block that no longer exists. We can detect this cause by checking if the block number ofhead
is greater than the current block number. And if that is the case, we can fix it by simply settinghead
to the current block.After making this change, I tested it manually using the steps described above and observed that geth no longer crashes and continues to mine starting from the new block number. I also ran
make test
to ensure that all the existing tests still pass (they do). However, there still could be some unintended side effects I'm not aware of.If we would like to add a test case for this, I would be happy to take a stab at it and include it in this PR. I just need some advice on where the test should live and how to get started on writing it.